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Abstract. Reverse inference, or "brain reading", is a recent paradigm for analyzing functional magnetic res- 
onance imaging (fMRI) data, based on pattern recognition and statistical learning. By predicting some cognitive 
variables related to brain activation maps, this approach aims at decoding brain activity. Reverse inference takes 
into account the multivariate information between voxels and is currently the only way to assess how precisely some 
cognitive information is encoded by the activity of neural populations within the whole brain. However, it relies on 
a prediction function that is plagued by the curse of dimensionality, since there are far more features than samples, 
i.e., more voxels than fMRI volumes. To address this problem, different methods have been proposed, such as, 
among others, univariate feature selection, feature agglomeration and regularization techniques. In this paper, we 
consider a sparse hierarchical structured regularization. Specifically, the penalization we use is constructed from a 
tree that is obtained by spatially-constrained agglomerative clustering. This approach encodes the spatial structure 
of the data at different scales into the regularization, which makes the overall prediction procedure more robust to 
inter-subject variability. The regularization used induces the selection of spatially coherent predictive brain regions 
simultaneously at different scales. We test our algorithm on real data acquired to study the mental representation of 
objects, and we show that the proposed algorithm not only delineates meaningful brain regions but yields as well 
better prediction accuracy than reference methods. 

Key words, brain reading, structured sparsity, convex optimization, sparse hierarchical models, inter-subject 
validation, proximal methods. 

AMS subject classifications. - 

1. Introduction. Functional magnetic resonance imaging (or fMRI) is a widely used 
functional neuroimaging modality. Modeling and statistical analysis of fMRI data are com- 
monly done through a linear model, called general linear model (GLM) in the community, 
that incorporates information about the different experimental conditions and the dynamics 
of the hemodynamic response in the design matrix. The experimental paradigm consists of 
a sequence of stimuli, e.g., visual and auditory stimuli, which are included as regressors in 
the design matrix after convolution with a suitable hemodynamic filter. The resulting model 
parameters — one coefficient per voxel and regressor — are known as activation maps. They 
represent the local influence of the different experimental conditions on fMRI signals at the 
level of individual voxels. The most commonly used approach to analyze these activation 
maps is called classical inference. It relies on mass-univariate statistical tests (one for each 
voxel), and yields so-called statistical parametric maps (SPMs) fT9l . Such maps are useful for 
functional brain mapping, but classical inference has some limitations: it suffers from mul- 
tiple comparisons issues and it is oblivious of the multivariate structure of fMRI data. Such 
data exhibit natural correlations between neighboring voxels forming clusters with different 
sizes and shapes, and also between distant but functionally connected brain regions. 

To address these limitations, an approach called reverse inference (or "brain-reading") (T4j 
IT3l was recently proposed. Reverse inference relies on pattern recognition tools and statis- 
tical learning methods to explore fMRI data. Based on a set of activation maps, reverse 
inference estimates a function that can then be used to predict a target (typically, a variable 
representing a perceptual, cognitive or behavioral parameter) for a new set of images. The 
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challenge is to capture the correlation structure present in the data in order to improve the 
accuracy of the fit, which is measured through the resulting prediction accuracy. Many stan- 
dard statistical learning approaches have been used to construct prediction functions, among 
them kernel machines (SVM, RVM) 1 54 1 or discriminant analysis (LDA, QDA) G2). For the 
application considered in this paper, earlier performance results JT3H32) indicate that we can 
restrict ourselves to mappings that are linear functions of the data. 

Throughout the paper, we shall consider a training set composed of n pairs (x, y) G 
l p x 3^, where x denotes a p-dimensional fMRI signal (p voxels) and y stands for the target 
we try to predict. Each fMRI data point x will correspond to an activation map after GLM 
fitting. In the experiments we carry out in Section|5j we will encounter both the regression and 
the multi-class classification settings, where y denotes respectively the set of real numbers 
and a finite set of integers. An example of a regression setting is the prediction of a pain level 
from fMRI data l37l or in the context of classification, the prediction of object categories fl3l . 
Typical datasets consists of a few hundreds of measurements defined each on a 2 x 2 x 2 -mm 
voxels grid forming p « 10 5 voxels when working with full brain data. Such numbers, given 
as illustration, are not intrinsic limitation of MRI technology and are still regularly improved 
by experts in the field. 

In this paper, we aim at learning a weight vector w G R p and an intercept b G R 
such that the prediction of y can be based on the value of w T x + b. This is the case for 
the linear regression and logistic regression models that we use in Section [5] The scalar b 
is not particularly informative, however the vector w corresponds to a volume that can be 
represented in brain space as a volume for visualization of the predictive pattern of voxels. It 
is useful to rewrite these quantities in matrix form; more precisely, we denote by X G R nxp 
the design matrix assembled from n fMRI volumes and by y G W 1 the corresponding n 
targets. In other words, each row of X is a p-dimensional sample, i.e., an activation map of p 
voxels related to one stimulus presentation. 

Learning the parameters (w, b) remains challenging since the number of features (10 4 
to 10 5 voxels) exceeds by far the number of samples (a few hundreds of volumes). The 
prediction function is therefore prone to overfitting in which the learning set is predicted 
precisely whereas the algorithm provides very inaccurate predictions on new samples (the 
test set). To address this issue, dimensionality reduction attempts to find a low dimensional 
subspace that concentrates as much of the predictive power of the original set as possible for 
the problem at hand. 

Feature selection is a natural approach to perform dimensionality reduction in fMRI, 
since reducing the number of voxels makes it easier to identify a predictive region of the 
brain. This corresponds to discarding some columns of X. This feature selection can be uni- 
variate, e.g., analysis of variance (ANOVA) (33), or multivariate. While univariate methods 
ignore joint information between features, multivariate approaches are more adapted to re- 
verse inference since they extract predictive patterns from the data as a whole. However, due 
to the huge number of possible patterns, these approaches suffer from combinatorial explo- 
sion, and some costly suboptimal heuristics (e.g., recursive feature elimination [21 , 39]) can 
be used. That is why ANOVA is usually preferred in fMRI. Alternatively, two more adapted 
solutions have been proposed: regularization and feature agglomeration. 

Regularization is a way to encode a priori knowledge about the weight vector w. Pos- 
sible regularizers can promote for example spatial smoothness or sparsity which is a natural 
assumption for fMRI data. Indeed, only a few brain regions are assumed to be significantly 
activated during a cognitive task. Previous contributions on fMRI-based reverse inference 
include (6l|5T][52||64l. They can be presented through the following minimization problem: 

min £(y,X,w,6) + Afi(w) with A > 0, (1.1) 

(w,6)GRP +1 
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where AO(w) is the regularization term, typically a non-Euclidean norm, and the fit to the 
data is measured through a convex loss function (w, b) \-t £(y, X, w, b) G R+. The choice 
of the loss function will be made more specific and formal in the next sections. The coef- 
ficient of regularization A balances the loss and the penalization term. In this notation, a 
common regularization term in reverse inference is the so-called Elastic net (671120) , which 
is a combined t\ and £2 penalization: 

p 

Afi(w) = Ai||w||i + A 2 ||w||l = { X ^j\ + A 2w|}. (1.2) 

For the squared loss, when setting Ai to 0, the model is called ridge regression, while when 
A2 = it is known as Lasso (58) or basis pursuit denoising (BPDN) [9]. The essential 
shortcoming of the Elastic net is that it does not take into account the spatial structure of 
the data, which is crucial in this context [43]. Indeed, due to the intrinsic smoothing of the 
complex metabolic pathway underlying the difference of blood oxygenation measured with 
fMRI [61 ], statistical learning approaches should be informed by the 3D grid structure of the 
data. 

In order to achieve dimensionality reduction, while taking into account the spatial struc- 
ture of the data, one can resort to feature agglomeration. Precisely, new features called 
parcels are naturally generated via averaging of groups of neighboring voxels exhibiting 
similar activations. The advantage of agglomeration is that no information is discarded a 
priori and that it is reasonable to hope that averaging might reduce noise. Although, this ap- 
proach has been successfully used in previous work for brain mapping (18] [57), existing work 
does typically not consider the supervised information (i.e., the target y) while exploring the 
parcels. A recent approach has been proposed to address this issue, based on a supervised 
greedy top-down exploration of a tree obtained by hierarchical clustering (42) . This greedy 
approach has proven to be effective especially for inter- subject analyzes, i.e., when training 
and evaluation sets are related to different subjects. In this context, methods need to be robust 
to intrinsic spatial variations that exist across subjects: although a preliminary co-registration 
to a common space has been performed, some variability remains between subjects, which 
implies that there is no perfect voxel-to- voxel correspondence between volumes. As a result, 
the performances of traditional voxel-based methods are strongly affected. Therefore, aver- 
aging in the form of parcels is a good way to cope with inter-subject variability. This greedy 
approach is nonetheless suboptimal, as it explores only a subpart of the whole tree. 

Based on these considerations, we propose to integrate the multi-scale spatial structure 
of the data within the regularization term ft, while preserving convexity in the optimization. 
This notably guarantees global optimality and stability of the obtained solutions. To this end, 
we design a sparsity-inducing penalty that is directly built from the hierarchical structure of 
the spatial model obtained by Ward's algorithm (62) using a contiguity-constraint [45]. This 
kind of penalty has already been successfully applied in several contexts, e.g., in bioinformat- 
ics, to exploit the tree structure of gene networks for multi-task regression (30) , in log-linear 
models for the selection of potential orders (53), in image processing for wavelet-based de- 
noising [ 3 , 28 , 49 1, and also for topic models (28) . Other applications have emerged in natural 
language (40) and audio processing (55). 

We summarize here the contributions of our paper: 
• We explain how the multi- scale spatial structure of fMRI data can be taken into 
account in the context of reverse inference through the combination of a spatially 
constrained hierarchical clustering procedure and a sparse hierarchical regulariza- 
tion. 
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• We provide a convex formulation of the problem and propose an efficient optimiza- 
tion procedure. 

• We conduct an experimental comparison of several algorithms and formulations on 
fMRI data and illustrate the ability of the proposed method to localize in space and 
in scale some brain regions involved in the processing of visual stimuli. 

The rest of the paper is organized as follows: we first present the concept of structured 
sparsity-inducing regularization and then describe the different regression/classification for- 
mulations we are interested in. After exposing how we handle the resulting large-scale convex 
optimization problems thanks to a particular instance of proximal methods — the forward- 
backward splitting algorithm, we validate our approach both in a synthetic setting and on a 
real dataset. 

2. Combining agglomerative clustering with sparsity-inducing regularizers. As sug- 
gested in the introduction, it is possible to construct a tree- structured hierarchy of new features 
on top of the original voxels using hierarchical clustering. Moreover, spatial constraints can 
be enforced in the clustering algorithm so that the underlying voxels corresponding to each 
of these features form localized spatial patterns on the brain similar to the ones we hope to re- 
trieve ifTOli . Once these features constructed, instead of selecting features in the tree greedily, 
we propose to cast the feature selection problem as supervised learning problem of the form 
One of the qualities of the greedy approach however is that it is only allowed to select 
potentially more noisy features, corresponding to smaller clusters, after the a priori more sta- 
ble features associated with ancestral clusters in the hierarchy have been selected. As we will 
show, it is possible to construct a convex regularizer ft that has the same property, i.e. that 
respects the hierarchy, and prioritizes the selection of features in the same way. Naturally, the 
regularizer has to be constructed directly from the hierarchical clustering of the voxels. 

2.1. Spatially-constrained hierarchical clustering. Starting from n fMRI volumes 
X = [x 1 , . . . ,x p ] G W ixp described by p voxels, we seek to cluster these voxels so as 
to produce a hierarchical representation of X. 

To this end, we consider hierarchical agglomerative clustering procedures l29l . These 
begin with every voxel x J representing a singleton cluster {j}, and at each iteration, a selected 
pair of clusters — according to a criterion discussed below — is merged into a single cluster. 
This procedure yields a hierarchy of clusters represented as a binary tree T (also often called 
a dendrogram) |29lL where each nonterminal node is associated with the cluster obtained by 
merging its two children clusters. Moreover, the root of the tree T is the unique cluster that 
gathers all the voxels, while the leaves are the clusters consisting of a single voxel. From now 
on, we refer to each nonterminal node of T as a parcel, which is the union of its children's 
voxels (see Figure |27T] ). 

Among different hierarchical agglomerative clustering procedures, we use the variance- 
minimizing approach of Ward's algorithm [62]. In short, two clusters are merged if the result- 
ing cluster minimizes the sum of squared differences of the fMRI signal within all clusters 
(also known as inertia criterion). More formally, at each step of the procedure, we merge the 
clusters c\ and C2 that minimize the criterion 

(X) Cl |||+^||x fc -(X) C2 |||) 

kec 2 

(2.1) 

where we have introduced the average vector (X) c = ^ X^e c x "^ • ^ n or der to take into 
account the spatial information, we also add connectivity constraints in the hierarchical clus- 



A(c 1)C2 )= J2 l|x J -(X> ClUC2 ||i-(£||x' 
(X) Cl -(X) C2 ||i, 



j<GciUc 2 

MM 



Cl + C2 
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tering algorithm, so that only neighboring clusters can be merged together. In other words, 
we try to minimize the criterion A(ci, C2) only for pairs of clusters which share neighbor- 
ing voxels (see Algorithm [TJ. This connectedness constraint is important since the resulting 
clustering is likely to differ from standard Ward's hierarchical clustering. 



FlG. 2.1. Example of a tree T when p = 5, 
with three voxels and two parcels. The parcel 2 is de- 
fined as the averaged intensity of the voxels {1, 2}, 
while the parcel 1 is obtained by averaging the val- 
ues of the voxels {1,2, 3}. In red dashed lines are 
represented the five groups of variables that com- 
pose Q. For instance, if the group containing the 
parcel 2 is set to zero, the voxels {1,2} are also ( and 
necessarily) zeroed out. Best seen in color. 




2.1.1. Augmented space of features. Based on the output of the hierarchical clustering 
presented previously, we define the following augmented space of variables (or features): 
instead of representing the n fMRI volumes only by its individual voxel intensities, we add 
to it a vector with levels of activation of each of the parcels at the interior nodes of the 
tree T, which we obtained from the agglomerative clustering algorithm. Since T has q = 
\T\=2p-l nodesQ the data obtained in the augmented space can be gathered in a matrix 
XeR nxq . 

In the following, the level of activation of each parcel is simply the averaged intensity 
of the voxels it is composed of (i.e., local averages) (T21 E21- This produces a multi-scale 
representation of the fMRI data that has the advantage of becoming increasingly invariant to 
spatial shifts of the encoding regions within the brain volume. We summarize the procedure 
to build the enlarged matrix X G W ixq in Algorithm 111 Let us now illustrate on the example 



Algorithm 1 Spatially-constrained agglomerative clustering and augmented feature space. 

Input: n fMRI volumes X = [x 1 , . . . , x p ] G W nxp described by p voxels. 
Output: n fMRI volumes X G W ixq in the augmented feature space. 
Initialization: C = {{j}; j G {1, . . . X = X. 
while \C\ > 1 do 

Find a pair of clusters c\ , c 2 G C which have neighboring voxels and minimize ( |2.1| ). 

C^C\{c u c 2 }. 
C^CU(ciUc 2 ). 

X<-[X, (X) Cl uc 2 ]. 
end while 
Return: X. 



of linear models (such as those we use in Section [5]) what are the implications of considering 
the augmented space of features. For a node j of T, we let denote by Pj C {1, . . . , p} the set 
of voxels of the corresponding parcel (or, equivalently, the set of leaves of the subtree rooted 



1 We can then identify nodes (and parcels) of T with indices in {1, . . . , q}. 
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at node j). In this notation, and for any fMRI volume x E I 9 in the augmented feature space, 
linear functions indexed by w £ R g take the form 



Xfc, 



q 1 P ^ 

/ w (x) = w T x = E w 4r^ E x *] = E [ E r 

j=l I 3 \ keP . k=1 jeAk I" . . 

where A k stands for the set of ancestors of a node k'mT (including itself). 

2.2. Hierarchical sparsity-inducing norms. In the perspective of inter- subject vali- 
dation, the augmented space of variables can be exploited in the following way: Since the 
information of single voxels may be unreliable, the deeper the node in T, the more variable 
the corresponding parcel's intensity is likely to be across subjects. This property suggests 
that, while looking for sparse solutions of we should preferentially select the variables 
near the root of T, before trying to access smaller parcels located further down in T. 

Traditional sparsity-inducing penalties, e.g., the £i-norm Sl(w) = Y%=i l w il> yield 
sparsity at the level of single variables Wj, disregarding potential structures — for instance, 
spatial — existing between larger subsets of variables. We leverage here the concept of struc- 
tured sparsity (3l|7l[^|24l[23[25l|4T][T6]|, where Q penalizes some predefined subsets, or 
groups, of variables that reflect prior information about the problem at hand. 

When these groups form a partition of the space of variables, the resulting penalty has 
been shown to improve the prediction performance and/or interpretability of the learned mod- 
els, provided that the block structure is relevant (e.g., see [60, 62 [35j [3TJ [56] [23] and refer- 
ences therein). 

If the groups overlap | 3, 66j[24l[26l[25l|36l, richer structures can then be encoded. In 
particular, we follow l66l who first introduced hierarchical sparsity-inducing penalties. Given 
a node j of T, we denote by gj C {1, . . . , q} the set of indices that record all the descendants 
of j in T, in cluding itself. In other words, gj contains the indices of the subtree rooted at j ; see 



Figure 2.1 If we now denote by Q the set of all g 3 ?, j G {!,..., q}, that is, Q — {gi , . . . , g q }, 



we can define our hierarchical penalty as 

,1/2 



^(w)^EKIl2 = E[E w l] 1/2 - (2-2) 



As formally shown in (26L ft is a norm on R g , and it promotes sparsity at the level of groups 
g G Q, in the sense that it acts as a £i-norm on the vector ( || || 2 )^ea • Regularizing by 
ft therefore causes some ||w p ||2 (and equivalently w g ) to be zeroed out for some g G Q. 
Moreover, since the groups g G Q represent rooted subtrees of T, this implies that if one 
node/parcel j G g is set to zero by ft, the same occurs for all its descendants 1 66 1 . To put it 
differently, if one parcel is selected, then all the ancestral parcels in T will also be selected. 
This property is likely to increase the robustness of the methods to voxel misalignments be- 
tween subjects, since large parcels will be considered for addition in the model before smaller 
ones. 

The family of norms with the previous property is actually slightly larger and we consider 
throughout the paper norms ft of the form lf66l : 

where ||w p || denotes either the £2 -norm ||w p ||2 or the ^ -norm Hw^H^ = maxj Gg |wj| and 
( r 1g)geG are (strictly) positive weights that can compensate for the fact that some features are 
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overpenalized as a result of being included in a larger number of groups than others. In light 
of the results from [28], we will see in Section]?] that a large class of optimization problems 



regularized by ft — as defined in (2.3 ) — can be solved efficiently. 



3. Supervised learning framework. In this section, we introduce the formulations we 
consider in our experiments. As further discussed in Section [5] the target y we try to pre- 
dict corresponds to (discrete) sizes of objects, i.e., a one-dimensional ordered variable. It is 
therefore sensible to address this prediction task from both a regression and a classification 
viewpoint. In the remainder of this section, we shall denote by {w*, 6*} (or {W*, b*}) a 
solution of the optimization problems we present below]^] For simplicity, the formulations 
we review next are all expressed in terms of a matrix X G R nxp with p-dimensional param- 
eters, but they are of course immediately applicable to the augmented data X G R nxq and 
g-dimensional parameters. 

3.1. Regression. In this first setting, we naturally consider the squared loss function, so 
that problem can be reduced to 

min -^||y - Xwf^ + Afi(w) with A > 0. 

Note that in this case, we have omitted the intercept b since we can center the vector y and 
the columns of X instead. Prediction for a new fMRI volume x is then simply performed by 
computing the dot product x T w*. 

3.2. Classification. We can look at our prediction task from a multi-class classification 
viewpoint. Specifically, we assume that y is a finite set of integers {1, . . . , c}, c > 2, and 
consider bot h mu lti-class and "one-versus-all" strategies l50l . We need to slightly extend the 
formulation To this end, we introduce the weight matrix W = [w 1 , . . . , w c ] G R pxc , 
composed of c weight vectors, along with a vector of intercepts b G M c . 

A standard way of addressing multi-class classification problems consists in using a 
multi-logit model, also known as multinomial logistic regression (see, e.g., l22l and refer- 
ences therein). In this case, class-conditional probabilities are modeled for each class by a 
softmax function, namely, given a fMRI volume x, the probability of having the k-th class 
label reads 

Probfa = fc|x; W,b) = ^r^/r + I for *e{l,...,c}. (3.1) 

L f =i eX P{ X W + b r| 

The parameters {W, b} are then learned by maximizing the resulting (conditional) log- 
likelihood, which leads to the following optimization problem: 

min -y\og\ye^^ k -^ y ^- h yA +AVll(w fc ) . 

heR c 1=1 k = l k=l 

Whereas the regularization term is separable with respect to the different weight vectors w^, 
the loss function induces a coupling in the columns of W. As a result, the optimization has 
to be carried out over the entire matrix W. In this setting, and given a new fMRI volume 
x, we make predictions by choosing the label that maximizes the class-conditional probabil- 



ities (3.1), that is, argmax^^ c }Prob(y = fc|x; W*,b*). 

m~Section[5j we consider another multi-class classification scheme. The "one-versus-all" 
strategy (OVA) consists in training c different (real- valued) binary classifiers, each one being 



2 In the absence of strong convexity, we cannot in general guarantee the uniqueness of w* (and W*). 
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trained to distinguish the examples in a single class from the observations in all remaining 
classes. In order to classify a new example, among the c classifiers, the one which outputs 
the largest (most positive) value is chosen. In this framework, we consider binary classifiers 
built from both the squared and the logistic loss functions. If we denote by Y G { — 1, l} nxc 
the indicator response matrix defined as = 1 if y^ = k and — 1 otherwise, we obtain 



and 



min — > HY*-Xw fc ||| + A> ^(w fc ), 

k—\ k=l 

n c c 

min -YT log [l + e -^K Twfc + b *)l + A V Q(w k ). 
bGRC i=i k=i k=i 

By invoking the same arguments as in Section [3TT] the vector of intercepts b is again omitted 
in the above problem with the squared loss. Moreover, given a new fMRI volume x, we 
predict the label k that maximizes the response x T [w*] fe among the c different classifiers. The 
case of the logistic loss function parallels the setting of the multinomial logistic regression, 
where each of the c "one-versus-all" classifiers leads to a class-conditional probability; the 
predicted label is the one corresponding to the highest probability. 

The formulations we have reviewed in this section can be solved efficiently within the 
same optimization framework that we now introduce. 

4. Optimization. The convex minimization problem {1.1) is challenging, since the penalty 
ft as defined in {2.3) is non-smooth and the number of variables to consider is large (we have 
q w 10 5 variables in the following experiments). These difficulties are well addressed by 
forward-backward splitting methods, which belong to the broader class of proximal methods. 
Forward-backward splitting schemes date back (at least) to (38] |34) and have been further 
analyzed in various settings (e.g., see EOEIEI); for a thorough review of proximal splitting 
techniques, we refer the interested readers to fiTTl . 

Our convex minimization problem can be handled well by such techniques since it is 
the sum of two semi-lower continuous, proper, convex functions with non-empty domain, and 
where one element — the loss function £(y, X, .) — is assumed differentiate with Lipschitz- 
continuous gradient (which notably covers the cases of the squared and simple/multinomial 
logistic functions, as introduced in Section [5]). 

To describe the principle of forward-backward splitting methods, we need to introduce 
the concept of proximal operator. The proximal operator associated with our regularization 
term Xft, which we denote by ProxA^, is the function that maps a vector w G W to the 
unique solution of 

1 

2 

This operator was initially introduced by Moreau l44l to generalize the projection operator 
onto a conv ex set; for a complete study of the properties of Prox^, see lfT2l . Based on defi- 
nition (4J_), and given the current iterate w^Jjthe typical update rule of forward-backward 
splitting methods has the forrrj^] 

w (*+i) ^ Prox£ n (w<*° - -^V£w(y,X, w<*>)), (4.2) 



min -|| v -w||| + Afl(v). (4.1) 



3 For clarity of the presentation, we do not consider the optimization of the intercept that we leave unregularized 
in all our experiments. 

4 For simplicity, we only present a constant-stepsize scheme; adaptive line search can also be used in this context 
and can lead to larger stepsizes flTI . 
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where L > is a parameter which is a upper bound on the Lipschitz constant of the gradient 



of C. In the light of the update rule (4.2 ), we can see that solving efficiently problem (4.1 ) is 



crucial to enjoy good performance. In addition, when the non-smooth term ft is not present, 



the previous proximal problem (4.2), also known as the implicit or backward step, exactly 



leads to the standard gradient update rule. 

For many regularizations ft of interest, the solution of problem ( |4.1| ) can actually be 
computed in closed form in simple settings: in particular, when ft is the ^i-norm, the proximal 
operator is the well-known soft-thresholding operator fT5l . The work of l28l recently showed 
that the proximal problem ( |4.1| ) could be solved efficiently and exactly with ft as defined in 
p.3| ). The underlying idea of this computation is to solve a well-ordered sequence of simple 
proximal problems associated with each of the terms \\w g || for g G Q. We refer the interested 
readers to 1 28 ] for further details on this norm and to [ 2 ] for a broader view. 

In the subsequent experiments, we focus on accelerated multi-step versions of fo rward- 
backward splitting methods (see, e.g., I46ll4ll63ll){^] where the proximal problem {A2) is not 



solved for a current estimate, but for an auxiliary sequence of points that are linear combina- 
tions of past estimates. These accelerated versions have increasingly drawn the attention of a 
broad research community since they can deal with large non- smooth convex problems, and 
their convergence rates on the objective achieve the complexity bound of 0(l/k 2 ), with k 
denoting the iteration number. As a side comment, note that as opposed to standard one-step 
forward-backward splitting methods, nothing can be said about the convergence of the se- 
quence of iterates themselves. In our case, the cost of each iteration is dominated by the com- 
putation of the gradient (e.g., 0(np) for the squared loss) and the proximal operator, whose 
time complexity is linear, or close to linear, in p for the tree- structured regularization (28). 

5. Experiments and results. We now present experimental results on simulated data 
and real fMRI data. 

5.1. Simulations. In order to illustrate the proposed method, the hierarchical regular- 
ization with the £2 -norm and rj 9 = 1 for all g G Q was applied in a regression setting on 
a small two-dimensional simulated dataset consisting of 300 square images (40 x 40 pixels 
i.e. X G 2£ 300x1600 ). The weight vector w used in the simulation — itself an image of the 
same dimension — is presented in Fig. |5.1| -a. It consists of three localized regions of two 
different sizes that are predictive of the output. The images are sampled so as to obtain 
a correlation structure which mimics fMRI data. Precisely, each image x^) was obtained by 
smoothing a completely random image — where each pixel was drawn i.i.d from a normal 
distribution — with a Gaussian kernel (standard deviation 2 pixels), which introduces spatial 
correlations between neighboring pixels. Subsequently, additional correlations between the 
regions corresponding to the three patterns were introduced in order to simulate co- activations 
between different brain regions, by multiplying the signal by the square-root of an appropriate 
covariance matrix E. Specifically, X) G ]R 1600x1600 i s a spatial covariance between voxels, 
with diagonal set to = 1 for all i, and with two off-diagonal blocks. Let us denote C\ and 
C2 the set of voxels forming the two larger patterns, and C3 the voxels in the small pattern. 
The covariance coefficients are set to X^j =0.3 for z G Ci and j G C2, and X^j = —0.2 for 
i G C2 and j G C3. The covariance is of course symmetric. 

The choice of the weights and of the correlation introduced in images aim at illustrat- 
ing how the hierarchical regularization estimates weights at different resolutions in the im- 
age. The targets were simulated by forming w T x^) corrupted with an additive white noise 



(SNR=10dB). The loss used was the squared loss as detailed in Section 3.1 The regular- 



5 More precisely, we use the accelerated proximal gradient scheme (FISTA) taken from 0. The Mat lab/ C++ 
implementation we use is available at http : / /www . di . ens . f r/willow/SPAMS/. 
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FlG. 5.1. Weights w* estimated in the simulation study. The true coefficients are presented in a) and the 
estimated weights at different scales, i.e., different depths in the tree, are presented in b)-h) with the same colormap. 
The results are best seen in color. 



ization parameter was estimated with two-fold cross-validation (150 images per fold) on a 
logarithmic grid of 30 values between 10 3 and 10 -3 . 

The components of the estimated weight vector w* at different scales are presented in 



the images of Fig. 5.1 , with each image corresponding to a different depth in the tree. For 
a given tree depth, an image is formed from the corresponding parcellation. All the voxels 
within a parcel are colored according to the associated scalar in w* . It can be observed that 
all three patterns are present in the weight vector but at different depths in the tree. The small 
activation in the top-right hand corner shows up mainly at scale 3 while the bigger patterns 
appear higher in the tree at scales 5 and 6. This simulation clearly illustrates the ability of the 
method to capture informative spatial patterns at different scales. 

5.2. Description of the fMRI data. We apply the different methods to analyze the data 
of ten subjects from an fMRI study originally designed to investigate object coding in visual 
cortex (see ifTTl for details). During the experiment, ten healthy volunteers viewed objects 
of two categories (each one of the two categories is used in half of the subjects) with four 
different exemplars in each category. Each exemplar was presented at three different sizes 
(yielding 12 different experimental conditions per subject). Each stimulus was presented four 
times in each of the six sessions. We averaged data from the four repetitions, resulting in a 
total of n = 72 volumes by subject (one volume of each stimulus by session). Functional 
volumes were acquired on a 3-T MR system with eight-channel head coil (Siemens Trio, 
Erlangen, Germany) as T2* -weighted echo-planar image (EPI) volumes. Twenty transverse 
slices were obtained with a repetition time of 2s (echo time, 30ms; flip angle, 70°; 2x2x2- 
mm voxels; 0.5-mm gap). Realignment, spatial normalization to MNI space, slice timing 
correction and GLM fit were performed with the SPM5 softwar^] In the GLM, the time 
course of each of the 12 stimuli convolved with a standard hemodynamic response function 
was modeled separately, while accounting for serial auto-correlation with an AR(1) model 
and removing low-frequency drift terms with a high-pass filter with a cut-off of 128s (7.8 x 
10 _3 Hz). In the present work we used the resulting session- wise parameter estimate volumes. 



6 http : / /www .fil.ion.ucl.ac. uk/ spm/ software/ spm5. 
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Contrary to a common practice in the field the data were not smoothed with an isotropic 
Gaussian filter. All the analysis are performed on the whole acquired volume. 

The four different exemplars in each of the two categories were pooled, leading to vol- 
umes labeled according to the three possible sizes of the object. By doing so, we are interested 
in finding discriminative information to predict the size of the presented object. 

This can be reduced to either a regression problem in which our goal is to predict the class 
label of the size of the presented object (i.e., y G {0,1, 2})Qor a three-category classification 
problem, each size corresponding to a category. We perform an inter- subject analysis on the 
sizes both in regression and classification settings. This analysis relies on subject- specific 
fixed-effects activations, i.e., for each condition, the six activation maps corresponding to 
the six sessions are averaged together. This yields a total of 12 volumes per subject, one 
for each experimental condition. The dimensions of the real data set are p « 7 x 10 4 and 
n = 120 (divided into three different sizes). We evaluate the performance of the method by 
cross-validation with a natural data splitting, leave-one-subject-out. Each fold consists of 12 
volumes. The parameter A of all methods is optimized over a grid of 30 values of the form 
2 fe , with a nested leave-one- subject-out cross-validation on the training set. The exact scaling 
of the grid varies for each model to account for different ft. 

5.3. Methods involved in the comparisons. In addition to considering standard t\- 
and squared ^-regularizations in both our regression and multi-class classification tasks, we 
compare various methods that we now review. 

First of all, when the regularization ft as defined in p.3| ) is employed, we consider three 
settings of values for (f] g ) ge g which leverage the tree structure T. More precisely, we set 
rj g = p de P th (#) for g in Q, with p G {0.5, 1, 1.5} and where depth(g) denotes the depth of the 
root of the group g in T. In other words, the larger p, the more averse we are to selecting 
small (and variable) parcels located near the leaves of T. As the results illustrate it, the 
choice of p can have a significant impact on the performance. More generally, the problem of 
selecting p properly is a difficult question which is still under investigation, both theoretically 
and practically, e.g., see Q. 

The greedy approach from [42) is included in the comparisons, for both the regression 
and classification tasks. It relies on a top-down exploration of the tree T. In short, start- 
ing from the root parcel that contains all the voxels, we choose at each step the split of the 
parcel that yields the highest prediction score. The exploration step is performed until a 
given number of parcels is reached, and yields a set of nested parcellations with increasing 
complexity. Similarly to a model selection step, we chose the best parcellation among those 
found in the exploration step. The selected parcellation is thus used on the test set. In the 
regression setting, this approach is combined with Bayesian ridge regression, while it is as- 
sociated with a linear support vector machine for the classification task (whose regularization 
parameter, often referred to as C in the literature 1541 , is found by nested cross-validation in 
{0.01,0.1,1}). 

5.3.1. Regression setting. In order to evaluate whether the level of sparsity is critical 
in our analysis, we implemented a reweighted t\ -scheme |5|. In this case, sparsity is encour- 
aged more aggressively as a multi-stage convex relaxation of a concave penalty. Specifically, 
it consists in using iteratively a weighted ^i-norm, whose weights are determined by the so- 
lution of previous iteration. Moreover, we additionally compare to Elastic net l67lL whose 
second regularization parameter is set by cross-validation as a fraction of A, that is, aX with 
a G {0.5,0.05,0.005,0.0005}. 



7 An interesting alternative would be to consider some real-valued dimension such as the field of view of the 
object. 
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To better understand the added value of the hierarchical norm ( |2.3| ) over unstructured 
penalties, we not only consider the plain ^i-norm in the augmented feature space, but also 
another variant of weighted ^i-norm. The weights are manually set and reflect the under- 
lying tree structure T. By analogy with the choice of (f] g ) ge g made for the tree- structured 
regularization, we take exponential weights depending on the depth of the variable j, where 
rjj = p de P th W with p e {0.5,1.5}^ We also tried weights (rjj)je{i,..., P } that are linear 
with respect to the depths, i.e., rjj = maXfc£{ d i epth ^ depth (/ c ) > but those led to worse results. In 
Table |5.l[ we only present the best result of this weighted ^i-norm, obtained with the ex- 
ponential weight and p = 1.5. We now turn to the models taking part in the classification 
task. 



5.3.2. Classification setting. As discussed in Section [X2| the optimization in the clas- 
sification setting is carried out over a matrix of weights W G M pxc . This makes it possible 
to consider regularization schemes that couple the selection of variables across rows of that 
matrix. 

In particular, we apply ideas from multi-task learning [47 ] by viewing each class as a 
task. More precisely, we use a regularization norm defined by ^multi-task (W) = Y^=i II Wj || , 
where ||Wj || denotes either the £2- or ^-norm of the j-th row of W. The rationale for the 
definition of ^ mu iti-task is to assume that the set of relevant voxels is the same across the c 
different classes, so that sparsity is induced simultaneously over the columns of W. It should 
be noted that , in the "one-versus-all" setting, although the loss functions for the c classes 
are decoupled, the use of ^multi-task induces a relationship that ties the optimization problems 
together. 

Note that the tree-structured regularization Q we consider does not impose a joint pattern- 
selection across the c different classes. It would however be possible to use ft over the matrix 
W, in a multi-task setting. More precisely, we would define fi(W) = ^2 ge g ||W P ||, where 
1 1 1 1 denotes either the £2 - or ^ -norm of the vectorized sub-matrix = [Wj k]jeg,ke{i,... 
This definition constitutes a direct extension of the standard non-overlapping £\/£2- and 
-norms used for multi-task. Furthermore, it is worth noting that the optimization tools 
from l28l would still apply for this tree- structured matrix regularization. 

5.4. Results. We present results comparing our approach based on the hierarchical 
sparsity-inducing norm ( |2.3| ) with the regularization listed in the previous section. For each 
method, we computed the cross- validated prediction accuracy and the percentage of non-zero 
coefficients, i.e., the level of sparsity of the model. 

5.4.1. Regression results. The results for the inter-subject regression analysis are re- 
ported in Table |5.1| The lowest error in prediction accuracy is obtained by both the greedy 
strategy and the proposed hierarchical structured sparsity approach (Tree £2 with p = 1) 
whose performances are essentially indistinguishable. Both also yield one of the lowest stan- 
dard deviation indicating that the results are most stable. This can be explained by the fact 
that the use of local signal averages in the proposed algorithm is a good way to get some 
robustness to inter-subject variability. We also notice that the sparsity-inducing approaches 
(Lasso and reweighted £\) have the highest error in prediction accuracy, probably because 
the obtained solutions are too sparse, and suffer from the absence of perfect voxel-to- voxel 
correspondences between subjects. 

In terms of sparsity, we can see, as expected, that ridge regression does not yield any 
sparsity and that the Lasso solution is very sparse (in the feature space, with approximately 



8 Formally, the depth of the feature j is equal to depth (gj), where gj is the smallest group in Q that contains j 
(smallest is understood here in the sense of the inclusion). 
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Loss function: Squared loss 




Error (mean,std) 


P-value w.r.t. Tree £2 (p = 1) 


Median fraction of non-zeros (%) 


Regularization: 








£2 (Ridge) 


(13.8,7.6) 


0.096 


100.00 


h 


(20.2, 10.9) 


0.013* 


0.11 


£1 + h (Elastic net) 


(14.4, 8.8) 


0.065 


0.14 


Reweighted i\ 


(18.8, 14.6) 


0.052 


0.10 


i\ (augmented space) 


(14.2, 7.9) 


0.096 


0.02 


i\ (tree weights) 


(13.9,7.9) 


0.032* 


0.02 


Tree £ 2 (p = 0.5) 


(13.0, 7.4) 


0.137 


99.99 


Tree £ 2 (p = 1) 


(11.8, 6.7) 




9.36 


Tree £ 2 (p = 1.5) 


(13.5,7.0) 


0.080 


0.04 


Tree £qo (p = 0.5) 


(13.6,7.8) 


0.080 


99.99 


Tree (p = 1) 


(12.8, 6.7) 


0.137 


1.22 


Tree (p = 1.5) 


(13.0, 6.8) 


0.096 


0.04 






Error (mean,std) 


P-value w.r.t. Tree £2 (p = 1) 


Median fraction of non-zeros (%) 


Greedy 


(12.0, 5.5) 


0.5 


0.01 



Table 5.1 



Prediction results obtained on fMRI data (see text) for the regression setting. From the left, the first column 
contains the mean and standard deviation of the test error (mean squared error), computed over leave-one-subject- 
out folds. The best performance is obtained with the greedy technique and the hierarchical £ 2 penalization (p — 1 ) 
constructed from the Ward tree. Methods with performance significantly worse than the latter is assessed by Wilcoxon 
two-sample paired signed rank tests (The superscript * indicates a rejection at 5%). Levels of sparsity reported are 
in the augmented space whenever it is used. 



7 x 10 4 voxels). Our method yields a median value of 9.36% of non-zero coefficients (in 
the augmented space of features, with about 1.4 x 10 5 nodes in the tree). The maps of 
weights obtained with Lasso and the hierarchical regularization for one fold, are reported 



in Fig. |5.2| The Lasso yields a scattered and overly sparse pattern of voxels, that is not 
easily readable, while our approach extracts a pattern of voxels with a compact structure, 
that clearly outlines brain regions expected to activate differentially for stimuli with different 
low-level visual properties, e.g., sizes; the early visual cortex in the occipital lobe at the back 
of the brain. Interestingly, the patterns of voxels show some symmetry between left and right 
hemispheres, especially in the primary visual cortex which is located at the back and center of 
the brain. This observation is consistent with the current understanding in neuroscience that 
the symmetric parts of this brain region process respectively the visual contents of each of the 
visual hemifields. The weights obtained at different depth level in the tree, corresponding to 
different scales, show that the largest coefficients are concentrated at the higher scales (scale 



6 in Fig. 5.2), which suggest that the object sizes cannot be well decoded at the voxel level 



but require features corresponding to larger clusters of voxels. 

5.4.2. Classification results. The results for the inter-subject classification analysis are 



reported in Table |5.2| The best performance is obtained with a multinomial logistic loss 
function, also using the hierarchical £2 penalization (p = 1). 

It should be noted that the sparsity level of the different model estimated vary widely 
depending on the loss and regularization used. With the squared loss, t\ type regularization, 
including the multi-task regularizations based on the t\jl2 and ^i/ioo norm tend to select 
quite sparse models, which keep around 0.1% of the voxels. When using logistic type losses, 
these regularizations tend to select a significantly large number of voxels, which suggests that 
the selection problem is really difficult and that these methods are unstable. For the methods 
with hierarchical regularization, on the contrary, the sparsity tends to improve with the choice 
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of loss functions that are better suited to the classification problem and tuning p trades off 
smoothly sparsity of the model against performance, from models that are not sparse when 
p is small to very sparse models when p is large. In particular a better compromise between 
sparsity and prediction performance can probably be obtained by tuning p G [1, 1.5]. 

For both £i and hierarchical regularizations, one of the three vectors of coefficients ob- 



tained for one fold and for the loss leading to sparser models are presented in Fig. |5.3| For 
£i, the active voxels are scattered all over the brain, and for other losses than the squared-loss 
the models selected tend not to be sparse. By contrast, the tree £2 regularization yields clearly 
delineated sparsity patterns located in the visual areas of the brain. Like for the regression 
results, the highest coefficients are obtained at scale 6 showing how spatially extended is 
the brain region involved in the cognitive task. The symmetry of the pattern at this scale is 
also particularly striking in the primary visual areas. It also extends more anteriorly into the 
inferior temporal cortex, known for high-level visual processing. 

6. Conclusion. In this article, we introduced a hierarchically structured regularization, 
which takes into account the spatial and multi-scale structure of fMRI data. This approach 
copes with inter-subject variability in a similar way as feature agglomeration, by averaging 
neighboring voxels. Although alternative agglomeration strategies do exist, we simply used 
the criterion which appears as the most natural, Ward's clustering, and which builds parcels 
with little variance. 

Results on a real dataset show that the proposed algorithm is a promising tool for min- 
ing fMRI data. It yields similar or higher prediction accuracy than reference methods, and 
the map of weights it obtains exhibit a cluster-like structure. It makes them easily readable 
compared to the overly sparse patterns found by classical sparsity-promoting approaches. 

For the regression problem, both the greedy method from [42 ] and the proposed algo- 
rithm yield better results than unstructured and non-hierarchical regularizations, whereas in 
the classification setting, our approach leads to the best performance. Moreover, our pro- 
posed methods enjoy the different benefits from convex optimization. In particular, while 
the greedy algorithm relies on a two-step approach that may be far from optimal, the hierar- 
chical regularization induces simultaneously the selection of the optimal parcellation and the 
construction of the optimal predictive model, given the initial hierarchical clustering of the 
voxels. Moreover, convex methods yield predictors that are essentially stable with respect to 
perturbations of the design or the initial clustering, which is typically not the case of greedy 
methods. It is important to distinguish here the stability of the predictors from that of the only 
learned map w, which could be enforced via a squared ^ 2 -norm regularization. 

Finally, it should be mentioned that the performance achieved by this approach in inter- 
subject problems suggests that it could potentially be used successfully for medical diagnosis, 
in a context where brain images -not necessarily functional images- are used to classify indi- 
viduals into diseased or control population. Indeed, for difficult problems of that sort, where 
the reliability of the diagnostic is essential, the stability of models obtained from convex 
formulations and the interpretability of sparse and localized solutions are quite relevant to 
provide a credible diagnostic. 
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FlG. 5.2. Maps of weights obtained using different regularizations in the regression setting, (a) i\ regulariza- 
tion - We can notice that the predictive pattern obtained is excessively sparse, and is not easily readable despite being 
mainly located in the occipital cortex, (b-d) tree £2 regularization (p = 1) at different scales - In this case, the reg- 
ularization algorithm extracts a pattern of voxels with a compact structure, that clearly outlines early visual cortex 
which is expected to discriminate between stimuli of different sizes. 3D images were generated with Mayavi H48H . 
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Loss function: Squared loss ("one-versus-all") 




Error (mean,std) 


P-value w.r.t. Tree t 2 (p = 1)-ML 


Median fraction of non-zeros (%) 


Regularization: 
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Loss function: Logistic loss ("one-versus-all") 




Error (mean,std) 


P-value w.r.t. Tree t 2 (p = 1)-ML 


Median fraction of non-zeros (%) 


Regularization: 
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Loss function: Multinomial logistic loss (ML) 




Error (mean,std) 


P-value w.r.t. Tree t 2 (p = 1)-ML 


Median fraction of non-zeros (%) 


Regularization: 
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0.01 



Table 5.2 



Prediction results obtained on fMRI data (see text) for the multi-class classification setting. From the left, the 
first column contains the mean and standard deviation of the test error (percentage of misclassification), computed 
over leave-one-subject-out folds. The best performance is obtained with the hierarchical i 2 penalization (p = 1 ) 
constructed from the Ward tree, coupled with the multinomial logistic loss function. Methods with performance sig- 
nificantly worse than this combination is assessed by Wilcoxon two-sample paired signed rank tests (The superscript 
* indicates a rejection at 5%). Levels of sparsity reported are in the augmented space whenever it is used. 
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FlG. 5.3. Maps of weights obtained using different regularizations in the classification setting, (a) t\ regu- 
larization (with squared loss) - We can notice that the predictive pattern obtained is excessively sparse, and is not 
easily readable with voxels scattered all over the brain, (b-d) tree regularization (with multinomial logistic loss) at 
different scales - In this case, the regularization algorithm extracts a pattern of voxels with a compact structure, that 
clearly outlines early visual cortex which is expected to discriminate between stimuli of different sizes. 



